{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download cHL Dataset and Preprocessing\n",
    "\n",
    "This tutorial will walk you to prepare examples data from the cHL dataset (Shaban et al., MAPS) to use KRONOS. \n",
    "\n",
    "### Marker Metadata\n",
    "\n",
    "Before running KRONOS, we need to define marker metadata. Unlike RGB images, where channels are fixed (red, green, blue) and normalization values are often hardcoded, spatial proteomics (SP) datasets vary in the number and type of channels. Marker names also differ in formatting (e.g., “KI67” might appear as “Ki-67” or “KI-67”) To address this, KRONOS expects a CSV file containing metadata for all markers embedded in the inference data. An initial CSV with 175 markers is provided in [marker_metadata.csv](https://huggingface.co/MahmoodLab/KRONOS/blob/main/marker_metadata.csv).\n",
    "\n",
    "The marker metadata CSV includes four columns:\n",
    "- **marker_name**: The name of the marker in uppercase.\n",
    "- **marker_id**: A unique identifier assigned to the marker in the pretraining dataset.\n",
    "- **marker_mean**: The mean intensity value of the marker, calculated from a reference dataset (e.g., KRONOS pretraining dataset).\n",
    "- **marker_std**: The standard deviation intensity value of the marker, also calculated from a reference dataset.\n",
    "\n",
    "<br/>\n",
    "\n",
    "### How Marker IDs are assigned to Markers\n",
    "Marker IDs are assigned as integers from 1 to 512. In the pretrained dataset, nuclear markers are assigned IDs from 1 to 127, while non-nuclear markers receive IDs from 128 to 512. This grouping helps capture high-level similarities between markers of the same type. Within each category, markers are arranged alphabetically, but only even-numbered IDs are assigned to those included in the pretrained dataset. The odd-numbered IDs are intentionally left unassigned, reserved for biologically similar markers that were not part of the pretrained dataset. This approach allows end-users to assign marker IDs from the odd-numbered values, ensuring that any newly added markers remain closely linked to the existing structure while preserving biological relevance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Download dataset and marker metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download cHL dataset (MAPS) and marker metadata \n",
    "from huggingface_hub import hf_hub_download\n",
    "import shutil\n",
    "import os\n",
    "from utils.chl_dataset_prep import download_and_prepare_chl_maps_dataset\n",
    "\n",
    "# Set project dir \n",
    "project_dir = \"./chl_maps_dataset\"\n",
    "\n",
    "# Download and prepare cHL dataset\n",
    "download_and_prepare_chl_maps_dataset(project_dir)\n",
    "\n",
    "# Download marker_metadata.csv \n",
    "cached_file = hf_hub_download(\n",
    "    repo_id=\"MahmoodLab/KRONOS\",\n",
    "    filename=\"marker_metadata.csv\"\n",
    ")\n",
    "shutil.copy(cached_file, os.path.join(os.path.join(project_dir, \"dataset\"), \"marker_metadata.csv\"))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Matching pretraining markers with cHL data markers \n",
    "\n",
    "The following script maps the marker information (stored in `marker_info.csv`) from the original dataset to those used in the KRONOS pretraining dataset (`marker_metadata.csv`). <br />\n",
    "It also displays a list of unmatched markers along with suggestions derived from marker name similarity with entries in `marker_metadata.csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import MarkerMetadata\n",
    "# Define the project directory\n",
    "project_dir = \"./chl_maps_dataset\"  # Replace with your actual project directory\n",
    "# Define paths for the dataset-specific marker info and the pretrained marker metadata files.\n",
    "marker_info_csv_path = f\"{project_dir}/dataset/marker_info.csv\"        # Path to the dataset-specific marker info file.\n",
    "marker_metadata_csv_path = f\"{project_dir}/dataset/marker_metadata.csv\"  # Path to the pretrained marker metadata file.\n",
    "top_suggestions = 5  # Number of top suggestions to display for each unmatched marker.\n",
    "\n",
    "# Create an instance of MarkerMetadata and retrieve the marker metadata.\n",
    "obj = MarkerMetadata(marker_info_csv_path, marker_metadata_csv_path, top_suggestions)\n",
    "obj.get_marker_metadata()\n",
    "\n",
    "# Display the number of markers that do not match the pretrained dataset.\n",
    "print(f\"There are {len(obj.missing_marker_dict)} markers that do not match with the markers in the pretrained dataset.\")\n",
    "\n",
    "# Show the top suggestions based on marker name similarity for each unmatched marker.\n",
    "print(f\"Below are the top {top_suggestions} marker name similarity suggestions for each missing marker:\")\n",
    "display(obj.missing_marker_df)\n",
    "\n",
    "# Display the dictionary for missing markers, which needs to be manually mapped to a biologically similar marker in marker_metadata.csv.\n",
    "print(\"The following dictionary contains missing markers that need to be manually mapped:\")\n",
    "display(obj.missing_marker_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Manual marker mapping \n",
    "\n",
    "If some markers do not match based on their names, you can manually adjust the mapping. Use the provided suggestions and/or the list of marker names in the marker_metadata.csv file. <br/>\n",
    "Simply copy the dictionary syntax from the previous step and update the values for the unmatched markers with a valid, biologically similar marker from the suggestions or the `marker_metadata.csv` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obj.missing_marker_dict = {\n",
    "    'BCL-2': 'BCL2',\n",
    "    'CD46': '',\n",
    "    'COLLAGEN 4': 'COLLAGEN',\n",
    "    'CYTOKERITIN': 'CYTOKERATIN',\n",
    "    'DAPI-01': 'DAPI',\n",
    "    'GRANZYME B': 'GZMB',\n",
    "    'IDO-1': 'IDO1',\n",
    "    'LAG-3': 'LAG3',\n",
    "    'MMP-9': 'MMP9',\n",
    "    'MUC-1': 'MUC1',\n",
    "    'PD-1': 'PD1',\n",
    "    'PD-L1': 'PDL1',\n",
    "    'T-BET': 'TBET',\n",
    "    'TCR-G-D': 'TCR-GD',\n",
    "    'TCRB': 'TCR-B',\n",
    "    'TIM-3': 'TIM3',\n",
    "    'VISA': ''\n",
    "    }\n",
    "\n",
    "# Retrieve marker metadata using the updated mapping.\n",
    "obj.get_marker_metadata_with_mapping()\n",
    "\n",
    "if len(obj.missing_marker_dict) > 0:\n",
    "    # Display the count of markers that still do not match the pretrained dataset.\n",
    "    print(f\"There are {len(obj.missing_marker_dict)} markers that still do not match the markers in the pretrained dataset.\")\n",
    "\n",
    "    # Display the dataframe of unmatched markers.\n",
    "    display(obj.missing_marker_df)\n",
    "\n",
    "    # Display the dictionary of unmatched markers that require manual mapping.\n",
    "    display(obj.missing_marker_dict)\n",
    "else:\n",
    "    print(\"All markers have been successfully mapped to the pretrained dataset.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 (Optional): Manually Set Metadata\n",
    "If some markers are still unmatched with the pretrained dataset and you can not ignore these marker then you can manually assign their marker ID, mean, and standard deviation values:\n",
    "\n",
    "- **Marker ID**: Choose an unassigned ID from the range 1–512 in marker_metadata.csv. Ideally, select an ID close to a biologically similar marker.\n",
    "- **Mean & Std Values**: Calculate these from your dataset for the corresponding markers. Ensure marker intensities are converted to float type and intensities are in range of 0-1 before computing the mean and standard deviation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "marker_metadata_dict = {\n",
    "        'CD46': {'marker_id': 295, 'marker_mean': 0.051, 'marker_std': 0.085},\n",
    "        'VISA': {'marker_id': 45, 'marker_mean': 0.015, 'marker_std': 0.014},\n",
    "    }\n",
    "\n",
    "obj.set_marker_metadata(marker_metadata_dict)\n",
    "# Display the count of markers that still do not match the pretrained dataset.\n",
    "if len(obj.missing_marker_dict) > 0:\n",
    "    print(f\"There are {len(obj.missing_marker_dict)} markers that still do not match the markers in the pretrained dataset.\")\n",
    "    \n",
    "    # Display the dataframe of unmatched markers.\n",
    "    display(obj.missing_marker_df)\n",
    "\n",
    "    # Display the dictionary of unmatched markers that require manual mapping.\n",
    "    display(obj.missing_marker_dict)\n",
    "else:\n",
    "    print(\"All markers now have valid metadata.\")\n",
    "display(obj.marker_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Save Final Dataset Specific Metadata File"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_csv_path = f\"{project_dir}/dataset/marker_info_with_metadata.csv\"\n",
    "obj.export_marker_metadata(output_csv_path)\n",
    "display(obj.marker_info)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "kronos_v2",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
